{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook describes the MATLAB package that implements the leave-out correction of <a href=\"https://eml.berkeley.edu/~pkline/papers/KSS2020.pdf\" target=\"_blank\"> Kline, Saggio and Sølvsten (2020)</a> -- KSS henceforth --  for two-way fixed effects models.\n",
    "\n",
    "\n",
    "# Introduction\n",
    "Economists often study settings where units possess two or more group memberships, some of which can change over time. A prominent example comes from Abowd, Kramarz, and Margolis (1999) (henceforth AKM) who proposed a panel model of log wage determination that is additive in worker and firm fixed effects. This so-called “two-way” fixed effects or \"AKM\" model takes the form:\n",
    "\\begin{equation}\n",
    "\ty_{{g} t} =  \\alpha_{{g}} + \\psi_{j({g},t)} + w_{{g} t}'\\delta +  \\varepsilon_{{g} t}  \\qquad({g}=1,\\dots,N, \\ t=1,\\dots,T_{g} \\ge 2)\n",
    "\\end{equation}\n",
    "where the function $j(\\cdot ,\\cdot ):\\lbrace 1,\\dots ,N\\rbrace \\times \\lbrace 1,\\dots ,\\max_g T_g \\rbrace \\to \\lbrace 0,\\dots ,J\\rbrace$  allocates each of $n=\\sum_{g=1}^N T_g$ person-year observations to one of $J+1$ firms. Here $\\alpha_g$ is a person effect, $\\psi_{j(g,t)}$ is a firm effect, $w_{gt}$ is a time-varying covariate, and $\\varepsilon_{gt}$ is a time-varying error. \n",
    "\n",
    "Note that can we simply rewrite the original AKM model as\n",
    "\\begin{equation}\n",
    "y_i =x_i^{\\prime } \\beta +\\varepsilon_i \\qquad i=1,...,n\n",
    "\\end{equation}\n",
    "\n",
    "where $i$ indexes a particular person $g$ year $t$ observation, $x_i$ is a vector that collects all the worker, firm dummies as well as the time-varying covariates $w_{gt}$so that $\\beta =(\\alpha ,\\psi ,\\delta )$' is a $k\\times 1$vector that collects all the worker, firm fixed effects along with $\\delta$.\n",
    "\n",
    "Interest in AKM models often centers on understanding how much of the variability in log wages is attributable to firms. One can summarize the firm contribution to wage inequality via the following two variance components parameters:\n",
    "\n",
    "\\begin{equation}\n",
    "\\sigma_{\\psi }^2 =\\frac{1}{n}\\sum_{g=1}^N \\sum_{t=1}^{T_g } {\\left(\\psi_{j\\left(g,t\\right)} -\\bar{\\psi} \\right)}^2 \\qquad \\text{and }\\sigma_{\\alpha ,\\psi } =\\frac{1}{n}\\sum_{g=1}^N \\sum_{t=1}^{T_g } \\left(\\psi_{j\\left(g,t\\right)} -\\bar{\\psi} \\right)\\alpha_g \t\t\t\n",
    "\\end{equation}\n",
    "\n",
    "where $\\bar{\\psi} =\\frac{1}{n}\\sum_{g=1}^N \\sum_{t=1}^{T_g } \\psi_{j(g,t)}$. The variance component $\\sigma_{\\psi }^2$ measures the contribution of firm wage variability to inequality, while the covariance component $\\sigma_{\\alpha ,\\psi }$ measures the additional contribution of systematic sorting of high wage workers to high wage firms. \n",
    "\n",
    "The function `leave_out_KSS` provides unbiased estimates of $\\sigma_{\\psi }^2$ and $\\sigma_{\\alpha ,\\psi }$ as well as an estimate of $\\sigma_{\\alpha }^2 =\\frac{1}{n}\\sum_{g=1}^N \\sum_{t=1}^{T_g } {\\left(\\alpha_g -\\bar{\\alpha} \\right)}^2$ using the leave-out bias correction approach proposed by KSS. In what follows, we use the words \"workers\" and \"firms\" when describing the procedure for simplicity but the the function `leave_out_KSS` can be applied to any two-way models (e.g. patients and doctors, students and teachers, etc).\n",
    "\n",
    "# Running the KSS Correction\n",
    "We now demonstrate the functioning of `leave_out_KSS` with a simple example. \n",
    "\n",
    "## Setup\n",
    "We begin with some auxiliary lines of code that define the relevant paths, call the CMG package developed by Yiannis Koutis and set-up the parallel environment within Matlab. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "a =\n",
      "\n",
      "     4\n",
      "\n",
      "Starting parallel pool (parpool) using the 'local' profile ...\n",
      "Connected to the parallel pool (number of workers: 6).\n"
     ]
    }
   ],
   "source": [
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "                     %Set-up                 \n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "clc\n",
    "clear\n",
    "cd '/Users/raffaelesaggio/Dropbox/LeaveOutTwoWay'\n",
    "path(path,'codes'); %this contains the main LeaveOut Routines.\n",
    "path(path,'CMG'); % CMG package http://www.cs.cmu.edu/~jkoutis/cmg.html\n",
    "%[result,output] = evalc('installCMG(1)'); %installs CMG routine (silently)\n",
    "%delete(gcp(\"nocreate\")) %clear parallel envir.\n",
    "%c = parcluster('local');  %tell me # of available cores\n",
    "%nw = c.NumWorkers; %tell me # of available cores\n",
    "%pool=parpool(nw,'IdleTimeout', Inf); %all cores will be assigned to Matlab\n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Importing the Data\n",
    "The Github Repo contains a matched employer-employee testing data where we observe the identity of the worker, the identity of the firm employing a given worker, the year in which the match is observed (either 1999 or 2001) and the associated log wage.\n",
    "\n",
    "*Important!:* the original data must be sorted by individual identifiers (id). For instance, one can see that the testing data is sorted by individual identifiers (and year, using `xtset id year` in Stata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "                    %IMPORT DATA\n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "%Input File\n",
    "namesrc='data/test.csv'; %path to original testing data\n",
    "data=importdata(namesrc); %import data\n",
    "id=data(:,1); %worker identifiers\n",
    "firmid=data(:,2); %firm identifiers\n",
    "y=data(:,4); % outcome variable\n",
    "clear data\n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calling the Main Function\n",
    "\n",
    "The function `leave_out_KSS` relies on three mandatory inputs: (y,id,firmid). We can obtain an unbiased variance decomposition of the associated AKM model by simply calling "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "           %RUN THE KSS LEAVE-OUT CORRECTION        \n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "[sigma2_psi,sigma_psi_alpha,sigma2_alpha]  = leave_out_KSS(y,id,firmid);\n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Interpreting the Output\n",
    "\n",
    "The code `leave_out_KSS` and associated output is composed by three sections. \n",
    "\n",
    "**Section 1**: Here we provide info on leave-out connected set. This is the largest connected set of firms that remains connected after removing any worker from the associated graph, see Lemma 1 and the  Computational Appendix of Kline, Saggio and Sølvsten (2020) for details. The code provides some summary statistics (e.g. # of movers, # of firms, mean and variance of the outcome, etc) for the  leave-out connected set.\n",
    "\n",
    "**Section 2**: After printing the summary statistics, the code computes the statistical leverages of the AKM model, denoted as $P_{ii}$ , and the variance components weights, $B_{ii}$ . Computation of  represents the main computational bottleneck of the routine. \n",
    "\n",
    "**Section 3**: The code then enters its third, and final, stage where the main results are printed. The code starts by reporting the --- biased --- estimates of the variance components using the plug-in approach described in Kline, Saggio and Sølvsten (2020). Then the code prints the bias corrected variance of firm effects and covariance of worker, firm effects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# What Does the Code Save?\n",
    "\n",
    "`leave_out_KSS` saves three scalars: the variance of firm effects, the covariance of worker, firm effects and the variance of person effects. \n",
    "\n",
    "`leave_out_KSS` also saves on disk one .csv file. This .csv contains information on the leave-out connected set. This file has 4 columns. First column reports the outcome variable, second and third columns the worker and the firm identifiers (as originally inputted by the user). The fourth column reports the statistical leverages of the AKM model. If the code is reporting a leave-out correction at the match-level, the .csv will be collapsed at the match level. By default, the .csv file is going to be saved in the main directory under the name `leave_out_estimates`. The user can specify an alternative path using the option `filename` when calling `leave_out_KSS`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scaling to Large Datasets\n",
    "\n",
    "`leave_out_KSS` can be used on extremely large datasets. The code uses a variant of the random projection method --- denoted as the JLA algorithm in KSS --- to approximate the statistical leverages of the AKM model,  $P_{ii}$ --- which are used to compute the leave-out OLS residual --- and $B_{ii}$ which measure the influence of the squared error terms for a given variance component, that is\n",
    "\n",
    "\\begin{equation}\n",
    "P_{ii} =x_i^{\\prime } S_{xx}^{-1} x_i ,\\qquad B_{ii} =x_i^{\\prime } S_{xx}^{-1} AS_{xx}^{-1} x_i.\n",
    "\\end{equation}\n",
    "\n",
    "where $S_{xx} =\\sum_{i=1}^n x_i x_i^{\\prime }$ and $A$is based upon a particular variance component of interest (see example 3 of KSS).\n",
    "\n",
    "The JLA algorithm provides a stochastic approximation to $\\lbrace P_{ii} ,B_{ii} \\rbrace_{i=1}^n$. The number of simulations underlying the JLA algorithm is governed by the input `simulations_JLA` (which is defined as $p$ in the computational appendix of KSS). Intuitively, more simulations imply a higher accuracy -- but higher computation time --- when estimating $\\lbrace P_{ii} ,B_{ii} \\rbrace_{i=1}^n$. \n",
    "\n",
    "**Note:** The user might want to pre-specify a seed for replicability of results when calling the function `leave_out_KSS`. \n",
    "\n",
    "We now demonstrate the performance of the code on a large dataset\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "           %LEAVE OUT CORRECTION IN A LARGE DATASET            \n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "%Read Data\n",
    "websave('large_fake.csv', 'https://www.dropbox.com/s/ny5tef29ij7ran2/large_fake_data.csv?dl=1'); %downloads and saves to disk a fake, large matched employer employee data\n",
    "namesrc='large_fake.csv'; %path to the large data\n",
    "data=importdata(namesrc); %import data \n",
    "id=data(:,1); %worker identifiers\n",
    "firmid=data(:,2); %firm identifiers\n",
    "y=data(:,4); % outcome variable\n",
    "clear data\n",
    "delete('large_fake.csv'); %delete original .csv data from disk\n",
    "\n",
    "%Run Leave Out Correction (50 simulations) \n",
    "type_of_algorithm='JLA'; %run random projection algorithm\n",
    "simulations_JLA=50;\n",
    "[sigma2_psi,sigma_psi_alpha,sigma2_alpha]  = leave_out_KSS(y,id,firmid,[],[],type_of_algorithm,simulations_JLA);\n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see from the output that the leave-out connected set has almost 14 million person-year observations. The code is able to complete in around 8 minutes (on a 2020 Macbook pro with 6 cores and 16GB of RAM). \n",
    "\n",
    "The computational appendix in KSS shows that the JLA algorithm can cut computation time by a factor of 100 while introducing an approximation error of roughly $10^{-4}$.\n",
    "\n",
    "The current code uses improved estimators of both  and  which are both guaranteed to lie in $[0;1]$. These improved estimators are then combined to derive an unbiased JLA estimator of  provided that $\\frac{n}{p^{4}}=o(1)$, see <a href=\"https://www.dropbox.com/s/i28yvzae2tnp2tl/improved_JLA.pdf?dl=1\" target=\"_blank\">this document</a> for details.  \n",
    "\n",
    "An empirical practice that we recommend is to verify the stability of the estimated variance components when increasing `simulations_JLA`. For instance, we can see that if we double `simulations_JLA` from 50 to 100 and run the code again on the same data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "    %STABILITY OF ESTIMATES FOR DIFFERENT SIMULATIONS IN JLA          \n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "simulations_JLA=100;\n",
    "[sigma2_psi,sigma_psi_alpha,sigma2_alpha]  = leave_out_KSS(y,id,firmid,[],[],type_of_algorithm,simulations_JLA); %check stability of variance components\n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We obtain virtually the same variance components as when simulations_JLA=50 while significantly increasing the computational time! If the user does not specify a value for simulations_JLA,the code defaults to simulations_JLA=XXX.\n",
    "\n",
    "We conclude this section by noticing that the user can also calculate an exact version of $\\lbrace P_{ii} ,B_{ii} \\rbrace_{i=1}^n$. This can be done by setting the option `type_of_algorithm` to \"exact\". We strongly recommend to avoid calling the option exact in large datasets. We, therefore, load again the original testing data and then confront the exact and JLA based estimates of the variance components"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "           %EXACT VS JLA           \n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "%Read Data\n",
    "namesrc='data/test.csv'; %path to original testing data\n",
    "data=importdata(namesrc); %import data\n",
    "id=data(:,1); %worker identifiers\n",
    "firmid=data(:,2); %firm identifiers\n",
    "y=data(:,4); % outcome variable\n",
    "clear data\n",
    "\n",
    "%Run Leave Out Correction with exact\n",
    "type_of_algorithm='exact'; %run random projection algorithm;\n",
    "[sigma2_psi,sigma_psi_alpha,sigma2_alpha]  = leave_out_KSS(y,id,firmid,[],[],type_of_algorithm);\n",
    "\n",
    "%Run Leave Out Correction with JLA\n",
    "simulations_JLA=100;\n",
    "type_of_algorithm='JLA'; %run random projection algorithm;\n",
    "[sigma2_psi,sigma_psi_alpha,sigma2_alpha]  = leave_out_KSS(y,id,firmid,[],[],type_of_algorithm,simulations_JLA);\n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the input data has more than 10,000 obs, the code defaults to using the JLA algorithm unless the user specifies type_of_algorithm to \"exact\". "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adding Controls\n",
    "\n",
    "We have demonstrated the functioning of `leave_out_KSS` using a simple AKM model with no controls ($w_{gt}=0$). It is easy to add a matrix of controls to the routine. Suppose for instance that we want to add to our AKM model year fixed effects. This can be done as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "           %ADDING CONTROLS           \n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "\n",
    "%Read Data\n",
    "namesrc='data/test.csv'; %path to original testing data\n",
    "data=importdata(namesrc); %import data\n",
    "id=data(:,1); %worker identifiers\n",
    "firmid=data(:,2); %firm identifiers\n",
    "year=data(:,3); %year identifier\n",
    "y=data(:,4); % outcome variable\n",
    "clear data\n",
    "\n",
    "%Specify year fixed effects as controls\n",
    "[~,~,controls] = unique(year);\n",
    "controls \t   = sparse((1:size(y,1))',controls',1,size(y,1),max(controls));\n",
    "controls       = controls(:,1:end-1); %to avoid collinearity issues, omit last year fixed effects.\n",
    "\n",
    "%Call KSS with matrix of controls\n",
    "[sigma2_psi,sigma_psi_alpha,sigma2_alpha]  = leave_out_KSS(y,id,firmid,controls);\n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When controls are specified, the code proceeds by partialling them out. That is, it first estimates the AKM model in the leave-out connected set\n",
    "\n",
    "\\begin{equation}\n",
    "y_{gt}=\\alpha_{g}+\\psi_{j(g,t)}+w_{gt}'\\delta+\\varepsilon_{gt}\n",
    "\\end{equation}\n",
    "\n",
    "from which we obtain $\\hat{\\delta}$. We then work with a residualized model where the outcome variable is now defined as $y_{gt}^{new}=y_{gt}-w_{gt}'\\hat{\\delta}$  and project this residualized outcome on worker and firm indicators and report the associated (bias-corrected) variance components. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Leaving out a Person-Year Observation vs. Leaving Out a Match\n",
    "\n",
    "By default, the code reports leave-out corrections for the variance of firm effects and the covariance of firm, worker effects that are robust to unrestricted heteroskedasticity and serial correlation of the error term  within a given match (unique combination of worker and firm identifier), see Remark 3 of KSS. We discuss the interpretation of the leave-out corrected variance of person effects when leaving a match out here. \n",
    "\n",
    "The user can specify the function to run the KSS correction when leaving only an observation out using the option `leave_out_level`. When leaving a single, person-year, observation out, the resulting KSS variance components are going to be robust to unrestricted heteroskedasticity only. Below we demonstrate how to compute KSS adjusted variance components when leaving a single (person-year) observation out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "        %Leaving out a Person-Year Observation vs. Leaving Out a Match           \n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "%Call KSS and leave a single person-year observation out\n",
    "leave_out_level='obs';\n",
    "[sigma2_psi,sigma_psi_alpha,sigma2_alpha]  = leave_out_KSS(y,id,firmid,[],leave_out_level);\n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When $T=2$ (i.e the underlying matched employer-employee data spans only two years), as in this example, it turns out that the KSS adjusted variance of firm effects and covariance of firm and worker effects is robust to any arbitrary correlation between $\\varepsilon_{g2}$ and $\\varepsilon_{g1}$. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Variance of Person Effects when Leaving Out a Match\n",
    "\n",
    "By leaving a match-out, the code provides a bias correction for the variance of firm effects and covariance of worker, firm effects that is robust to unrestricted hetoreskedasticity and serial correlation within a given match of the error term of the AKM model.\n",
    "\n",
    "However, when performing a bias correction based on leaving a match out, the person effects, , of stayers --- workers that never leave a particular firm ---  are not identified. This implies that we cannot compute an unbiased estimate of  for these individuals. An estimate of  for both stayers and movers is required in order to provide a bias correction for the variance of person effects, see equation (1) and Remark 3 in KSS.\n",
    "\n",
    "The current implementation of the code provides an estimate of $\\sigma^{2}_{gt}$ for stayers that is robust to unrestricted heteroskedasticity while using an estimate of $\\sigma^{2}_{gt}$ for movers that is robust to both unrestricted heteroskedasticity and serial correlation within a match. This allows us to compute an estimate of the variance of person effects (and correlation of worker, firm effects) that is  unbiased in the presence of unrestriced heteroskedatisticity and that represents an upper bound for the variance of person effects in the presence of (positive) serial correlation of $\\varepsilon_{gt}$ within a match (and therefore a lower bound for the correlation in worker, firm effects).   \n",
    "\n",
    "There are several alternatives that the user can explore: \n",
    "\n",
    "\n",
    "1. Estimate a variance decomposition in a sample of movers only: For movers, it is possible to estimate a leave-out bias corrected variance of person effects that is robust to both unrestricted hetoreskedasticity and serial correlation in the error term of the AKM model within a given match. Therefore, one can provide an unbiased variance decomposition of all the three components of the two-way fixed effects model by simply feeding to the function `leave_out_KSS` a movers-only sample.\n",
    "\n",
    "\n",
    "\n",
    "2. Work with a long difference panel: To minimize potential biases  in the variance of person effects due to the presence of serial correlation within a match, one can  work with a \"long-difference\" panel where $T=2$. That is, suppose the original data runs from, say, 1990 to 1996 ($T=7$). One can select only the years 1990 and 1996 ($T=2$) and feed these data to `leave_out_KSS`. The idea here is that with long differences: (i) one is less likely to observe stayers (ii) the risk of observing correlation in the match component between the start (say, 1990) and end of the spell (1996) is minimized.    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regressing Firm Fixed Effects on Observables\n",
    "\n",
    "It is common in empirical applications to regress the fixed effects estimated from the two-way model on some observables characteristics. Using the AKM model again as our leading example, suppose we are interested in the linear projection of the firm effecs, $\\psi_{gt}$, on some observables, $Z_{gt}$:\n",
    "\n",
    "\\begin{equation}\n",
    "\\psi_{j(g,t)}=Z_{gt}'\\gamma+e_{gt}\n",
    "\\end{equation}\n",
    "\n",
    "Typical practice is to estimate $\\gamma$ using a simple regression where the estimated firm effects, $\\hat{\\psi}_{j(g,t)}$, are regressed on $Z_{gt}$\n",
    "\n",
    "\\begin{equation}\n",
    "\\hat{\\gamma}=\\left(\\sum_{g,t}Z_{gt}Z_{gt}'.\n",
    "\\right)^{-1}Z_{gt}\\hat{\\psi}_{gt}\n",
    "\\end{equation}\n",
    "\n",
    "KSS show that inference on $\\hat{\\gamma}$ needs to be adjusted. The reason is that the firm fixed effects $\\hat{\\psi}_{j(g,t)}$ are all potentially correlated with one another. To see this, suppose that we have a simple AKM model with only two time periods, set $w_{gt}=0$, and take first differences $\\Delta y_{g}\\equiv y_{g2}-y_{g1}$  to eliminate the worker fixed effects so that the AKM model becomes\n",
    "\\begin{equation}\n",
    "\\Delta y_{g}=\\Delta f_{g}'\\psi+\\varepsilon_{g}\n",
    "\\end{equation}\n",
    "\n",
    "where $\\Delta f_{g}=f_{g,2}-f_{g,1}$ and $f_{gt}=\\{\\mathbf{1}_{j(g,t)=1},..,\\mathbf{1}_{j(g,t)=J}\\}$ is the vector containing the firm dummies. In this model,\n",
    "\n",
    "\\begin{equation}\n",
    "\\hat{\\psi}=\\psi+\\underbrace{\\sum_{g=1}^{N}(\\Delta f_{g}\\Delta f_{g}')^{-1}\\Delta f_{g}\\varepsilon_{g}}_{\\text{Correlated Noise}}\n",
    "\\end{equation}\n",
    "\n",
    "Notice how the dependence in the vector of estimated firm fixed effects, $\\hat{\\psi}$, is induced by the design $\\sum_{g=1}^{N}(\\Delta f_{g}\\Delta f_{g}')^{-1}$.  Intuitively, conducting inference on $\\hat{\\gamma}$ is challenging as we have to provide standard errors in a context where all of our underlying observations are potentially correlated with one another (aka need cluster robust inference when there is only one giant cluster!).\n",
    "\n",
    "The package provides the correct standard errors on $\\hat{\\gamma}$ using the function `lincom_KSS` which is designed like the Stata function `lincom` and therefore works as a post-estimation command. We demonstrate the functioning of `lincom_KSS` with an example.\n",
    "\n",
    "In this example, we are interested in testing whether the difference in person-year weighted mean firm effects between region 1 and region 2 is statistically different from zero. This amounts to running a regression where the depedent variable is the vector of estimated firm effects and the set of observables, $Z_{gt}$ , here is represented by a constant and a dummy for whether the firm of worker $g$ in year $t$ belongs to region 2.\n",
    "\n",
    "The resulting coefficient (and standard error) can be computed by calling the function `leave_out_KSS` specifying that we want to run the `lincom` option and using the region dummy as $Z_{gt}$ (the constant is automatically added by the code)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "           %REGRESSING FIRM EFFECTS ON OBSERVABLES           \n",
    "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n",
    "\n",
    "%Import data \n",
    "namesrc='data/lincom.csv'; %testing data for the lincom function\n",
    "data=importdata(namesrc);\n",
    "id=data(:,1); \n",
    "firmid=data(:,2);\n",
    "y=data(:,5);\n",
    "region=data(:,4); %Region indicator. Value -1 for region 1, Value 1 for region 2;\n",
    "region_dummy=region;\n",
    "region_dummy(region_dummy==-1)=0; %Make it a proper dummy variable\n",
    "\n",
    "%Run the KSS correction and \"lincom\"\n",
    "labels_lincom={'Region 2 Dummy'}; %give me the label of the columns of Z.\n",
    "lincom_do=1; %tell the function leave_out_KSS that we want to project the firm effects on some Z.\n",
    "Z=region_dummy; %we're going to project the firm effects on a constant +  the region dummy. Constant automatically added by the code\n",
    "\n",
    "%Ready to call KSS with lincom option!\n",
    "[sigma2_psi,sigma_psi_alpha,sigma2_alpha] = leave_out_KSS(y,id,firmid,[],[],[],[],lincom_do,Z,labels_lincom);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see from the output above (make sure to scroll until the end) that the difference in person-year weighted mean firm effects between the two regions is equal to 0.26 with a corresponding standard error of roughly 0.09. As shown in Table III of KSS, standard error estimates from two steps regressions that ignore dependence between the estimated fixed effects can yield highly misleading inferences."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Matlab",
   "language": "matlab",
   "name": "matlab"
  },
  "language_info": {
   "codemirror_mode": "octave",
   "file_extension": ".m",
   "help_links": [
    {
     "text": "MetaKernel Magics",
     "url": "https://metakernel.readthedocs.io/en/latest/source/README.html"
    }
   ],
   "mimetype": "text/x-octave",
   "name": "matlab",
   "version": "0.16.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
